08. Tokenizing Words

Words to Vectors

At this point, we know that you cannot directly feed words into an LSTM and expect it to be able to train or produce the correct output. These words first must be turned into a numerical representation so that a network can use normal loss functions and optimizers to calculate how "close" a predicted word and ground truth word (from a known, training caption) are. So, we typically turn a sequence of words into a sequence of numerical values; a vector of numbers where each number maps to a specific word in our vocabulary.

To process words and create a vocabulary, we'll be using the Python text processing toolkit: NLTK. in the below video, we have one of our content developers, Arpan, explain the concept of word tokenization with NLTK.

Tokenization

Reference:

Later, you'll see how we take a tokenized representation of a caption to create a Python dictionary that maps unique words in our captions dataset to unique integers.